32 research outputs found

    VAT tax gap prediction: a 2-steps Gradient Boosting approach

    Full text link
    Tax evasion is the illegal evasion of taxes by individuals, corporations, and trusts. The revenue loss from tax avoidance can undermine the effectiveness and equity of the government policies. A standard measure of tax evasion is the tax gap, that can be estimated as the difference between the total amounts of tax theoretically collectable and the total amounts of tax actually collected in a given period. This paper presents an original contribution to bottom-up approach, based on results from fiscal audits, through the use of Machine Learning. The major disadvantage of bottom-up approaches is represented by selection bias when audited taxpayers are not randomly selected, as in the case of audits performed by the Italian Revenue Agency. Our proposal, based on a 2-steps Gradient Boosting model, produces a robust tax gap estimate and, embeds a solution to correct for the selection bias which do not require any assumptions on the underlying data distribution. The 2-steps Gradient Boosting approach is used to estimate the Italian Value-added tax (VAT) gap on individual firms on the basis of fiscal and administrative data income tax returns gathered from Tax Administration Data Base, for the fiscal year 2011. The proposed method significantly boost the performance in predicting with respect to the classical parametric approaches.Comment: 27 pages, 4 figures, 8 tables Presented at NTTS 2019 conference Under review at another peer-reviewed journa

    A non-parametric Hawkes process model of primary and secondary accidents on a UK smart motorway

    Get PDF
    A self-exciting spatio-temporal point process is fitted to incident data from the UK National Traffic Information Service to model the rates of primary and secondary ac- cidents on the M25 motorway in a 12-month period during 2017-18. This process uses a background component to represent primary accidents, and a self-exciting component to represent secondary accidents. The background consists of periodic daily and weekly components, a spatial component and a long-term trend. The self-exciting components are decaying, unidirectional functions of space and time. These components are de- termined via kernel smoothing and likelihood estimation. Temporally, the background is stable across seasons with a daily double peak structure reflecting commuting patterns. Spatially, there are two peaks in intensity, one of which becomes more pronounced dur- ing the study period. Self-excitation accounts for 6-7% of the data with associated time and length scales around 100 minutes and 1 kilometre respectively. In-sample and out- of-sample validation are performed to assess the model fit. When we restrict the data to incidents that resulted in large speed drops on the network, the results remain coherent

    Bayesian hierarchical modeling and analysis for physical activity trajectories using actigraph data

    Full text link
    Rapid developments in streaming data technologies are continuing to generate increased interest in monitoring human activity. Wearable devices, such as wrist-worn sensors that monitor gross motor activity (actigraphy), have become prevalent. An actigraph unit continually records the activity level of an individual, producing a very large amount of data at a high-resolution that can be immediately downloaded and analyzed. While this kind of \textit{big data} includes both spatial and temporal information, the variation in such data seems to be more appropriately modeled by considering stochastic evolution through time while accounting for spatial information separately. We propose a comprehensive Bayesian hierarchical modeling and inferential framework for actigraphy data reckoning with the massive sizes of such databases while attempting to offer full inference. Building upon recent developments in this field, we construct Nearest Neighbour Gaussian Processes (NNGPs) for actigraphy data to compute at large temporal scales. More specifically, we construct a temporal NNGP and we focus on the optimized implementation of the collapsed algorithm in this specific context. This approach permits improved model scaling while also offering full inference. We test and validate our methods on simulated data and subsequently apply and verify their predictive ability on an original dataset concerning a health study conducted by the Fielding School of Public Health of the University of California, Los Angeles

    Finite mixtures in capture-recapture surveys for modelling residency patterns in marine wildlife populations

    Full text link
    In this work, the goal is to estimate the abundance of an animal population using data coming from capture-recapture surveys. We leverage the prior knowledge about the population's structure to specify a parsimonious finite mixture model tailored to its behavioral pattern. Inference is carried out under the Bayesian framework, where we discuss suitable priors' specification that could alleviate label-switching and non-identifiability issues affecting finite mixtures. We conduct simulation experiments to show the competitive advantage of our proposal over less specific alternatives. Finally, the proposed model is used to estimate the common bottlenose dolphins' population size at the Tiber River estuary (Mediterranean Sea), using data collected via photo-identification from 2018 to 2020. Results provide novel insights on the population's size and structure, and shed light on some of the ecological processes governing the population dynamics

    Nowcasting COVID-19 incidence indicators during the Italian first outbreak

    Get PDF
    A novel parametric regression model is proposed to fit incidence data typically collected during epidemics. The proposal is motivated by real-time monitoring and short-term forecasting of the main epidemiological indicators within the first outbreak of COVID-19 in Italy. Accurate short-term predictions, including the potential effect of exogenous or external variables are provided. This ensures to accurately predict important characteristics of the epidemic (e.g., peak time and height), allowing for a better allocation of health resources over time. Parameter estimation is carried out in a maximum likelihood framework. All computational details required to reproduce the approach and replicate the results are provided.publishedVersio

    Covid‐19 in Italy: Modelling, communications, and collaborations

    Get PDF
    When Covid-19 arrived in Italy in early 2020, a group of statisticians came together to provide tools to make sense of the unfolding epidemic and to counter misleading media narratives. Here, members of StatGroup-19 reflect on their work to dat

    Features of Primary Chronic Headache in Children and Adolescents and Validity of Ichd 3 Criteria

    Get PDF
    Introduction: Chronic headaches are not a rare condition in children and adolescents with negative effects on their quality of life. Our aims were to investigate the clinical features of chronic headache and usefulness of the International Classification of Headache Disorders 3rd edition (ICHD 3) criteria for the diagnosis in a cohort of pediatric patients.Methods: We retrospectively reviewed the charts of patients attending the Headache Center of Bambino GesĂč Children and Insubria University Hospital during the 2010–2016 time interval. Statistical analysis was conducted to study possible correlations between: (a) chronic primary headache (CPH) and demographic data (age and sex), (b) CPH and headache qualitative features, (c) CPH and risk of medication overuse headache (MOH), and (d) CPH and response to prophylactic therapies. Moreover, we compared the diagnosis obtained by ICHD 3 vs. ICHD 2 criteriaResults: We included 377 patients with CPH (66.4% females, 33.6% males, under 18 years of age). CPH was less frequent under 6 years of age (0.8%; p < 0.05) and there was no correlation between age/sex and different CPH types. The risk to develop MOH was higher after 15 years of age (p < 0.05). When we compared the diagnosis obtained by ICHD 2 and ICHD 3 criteria we found a significant difference for the undefined diagnosis (2.6% vs. 7.9%; p < 0.05), while the diagnosis of probable chronic migraine was only possible by using the ICHD2 criteria (11.9% of patients; p < 0.05). The main criterion which was not satisfied for a definitive diagnosis was the duration of the attacks less than 2 h (70% of patients younger than 6 years; p < 0.005). Amitriptyline and topiramate were the most effective drugs (p < 0.05), although no significant difference was found between them (p > 0.05).Conclusion: The ICHD 3 criteria show limitations when applied to children under 6 years of age. The risk of developing MOH increases with age. Although our “real word” study shows that amitriptyline and topiramate are the most effective drugs regardless of the CPH type, the lack of placebo-controlled data and the limited follow-up results did not allow us to conclude about the drug efficacy

    Innovative approaches in spatio-temporal modeling: handling data collected by new technologies

    No full text
    This thesis illustrates and puts in context two of the main research projects I worked on during my Ph.D. program, in collaboration with several national and international co-authors from "La Sapienza" and other prestigious universities. Both research lines concern spatial and spatio-temporal analysis of geo-referenced datasets, which is of broad and current interest in the statistical research literature and applications. My focus on such an area of statistics was not meditated before the start of the program. However, while pursuing my original research interests in the broader domain of Bayesian statistics, I realized there was an ever-increasing demand for viable and efficient statistical methods to analyze spatial and spatio-temporal data. That is a consequence of the extraordinary technological development that interested data collection systems during the last few decades. The innovative, cutting-edge technologies conceive new devices that can record and store data and information about the most diverse phenomena, possibly at a fine spatial scale and with high temporal resolution. Such capabilities were just a dream up to 20 or 30 years ago. Spatial statistics methods are rapidly evolving to face this surge of novel data structure in various application fields: geology, meteorology, ecology, epidemiology, economics, politics, and more. The first chapter of this thesis introduces the general idea behind spatial statistics, that is the branch of statistics devoted to analyzing and modeling temporal and spatial structure in time and/or geo-referenced datasets. A brief historical introduction of its developments is provided, starting from the first (sometimes unwitting) applications of its logic to practical and theoretical problems at the end of the XIX century. Many methods and techniques in this domain evolved independently, driven by the specific needs of the application fields in which they were developed. The historical excursus leads to a coarse (but reasonable) distinction in three main areas: continuous spatial variations, discrete spatial variations, spatial point patterns. These areas present further facets within themselves, making spatial statistics an incredibly diverse and rich topic. A really comprehensive review would require an entire book to be written and maybe a lifetime to be thoroughly studied. Therefore, in the following Chapters, the discussion is focused on specific areas and techniques used in the studies. Only those tools that proved valuable for the analysis performed in Alaimo Di Loro et al.(2021) and Kalair et al. (2020) are extensively treated. The second chapter focuses on analyzing continuous spatial variation, which is the modeling of outcomes varying continuously over some space. First, the most relevant properties for continuous spatial processes are introduced; second, some of the most common methodologies for performing spatial interpolation of the mean trend and stochastic modeling of the residuals are listed and sketched. In particular, the chapter digresses on Spline Regression as a valid technique to catch the first-order structure in spatial data. Soon after, the Geo-Statistical methods and the Bayesian Hierarchical framework are claimed as invaluable tools to attain the simultaneous estimation of the first and second-order structure of a process. Extension to spatio-temporal contexts is not as trivial as it may seem but must be approached with due care. An extensive discussion about the possible pitfalls and viable solutions is included in the same chapter. Finally, the problems arising in the analysis of Big spatial data are highlighted in the last section, where The Nearest Neighbor Gaussian Process (NNGP, Datta et al. (2016a,b)) model is introduced as a highly scalable framework for providing full inference on massive spatial and spatio-temporal datasets. The third chapter includes an extended version of the paper Alaimo Di Loro et al. (2021), currently under-review and published as a pre-print. It describes how the aforementioned technological development has strongly affected human tracking and monitoring capabilities, generating substantial interest in monitoring human activity. New non-intrusive wearable devices, such as wrist-worn sensors that monitor gross motor activity (miniature accelerometers), can continuously record individual activity levels, producing massive amounts of high-resolution measurements. Analyzing such data needs to account for spatial and temporal information on trajectories or paths traversed by subjects wearing such devices. Inferential objectives include estimating a subject’s physical activity levels along a given trajectory, identifying trajectories that are more likely to produce higher levels of physical activity for a given subject, and predicting expected levels of physical activity in any proposed new trajectory for a given set of health attributes. We argue that the underlying process is more appropriately modeled as a stochastic evolution through time while accounting for spatial information separately. Building upon recent developments in this field, we construct temporal processes using directed acyclic graphs (DAG) on the line of the NNGP, include spatial dependence through penalized spline regression, and develop optimized implementations of the collapsed Markov chain Monte Carlo (MCMC) algorithm. The resulting Bayesian hierarchical modeling framework for the analysis of spatial-temporal actigraphy data proves able to deliver fully model-based inference on trajectories while accounting for subject-level health attributes and spatial-temporal dependencies. We undertake a comprehensive analysis of an original dataset from the Physical Activity through Sustainable Transport Approaches in Los Angeles (PASTA-LA) study to formally ascertain spatial zones and trajectories exhibiting significantly higher physical activity levels. Suggestions for further extensions and improvements on the currently adopted methodology are discussed in the last section of the chapter. Chapter four undergoes a paradigm shift and introduces the basic theory and tools of spatial point patterns analysis. Some common probabilistic models for point processes are briefly discussed, with some of their properties and limitations highlighted. The rest of the chapter is instead entirely focused on the Hawkes process and its spatio-temporal extension. It is a particular kind of self-exciting point process that presents a strong inter-dependence structure. While conceived in Hawkes (1971a), its use in the statistical application has been for a long time limited to the analysis of earthquakes dynamic. The recent escalation of data at the high temporal resolution, sometimes accompanied by spatial information, has favored its use in modeling events dynamics in diverse fields: finance, society, biology, etc. In particular, its defining properties are presented and state-of-the-art estimation methods of the spatio-temporal version are introduced. In the fifth chapter, the semi-parametric Hawkes process with a periodic background originally introduced in Zhuang and Mateu (2019) is outlined. While very recent, it has already revealed itself very useful to model phenomena that are likely to present a cyclic pattern. It assumes that primary events occur as an effect of the background intensity, while secondary events are associated with the self-excitation effect. There are sound motivations that justify its utilization in the context of road accident dynamics, e.g.: excitation may occur when a driver, reacting to the disruption of one accident, triggers a subsequent accident upstream of the first one. The proposed framework is tested on two original applications on two original sets of data: the first one, somewhat preliminary, involves the modeling and analysis of road accidents that occurred on the urban road network of Rome, in Italy; the second is instead a conclusive analysis recently published in (Kalair et al., 2020), conducted on a collection of road accidents occurred on the M25 London Orbital, in the United Kingdom. Adaptations of the original methodology to the road accident setting were deemed necessary in both cases to consider specific features of car accidents and the geometry of the underlying space. The final results permit a fruitful interpretation of the temporal and spatial background that detects the typical commuting behavior in the Roman and Londoners communities. The self-excitation component appears to have slightly different intensities in the two contexts, suggesting excitation mechanisms that vary between urban networks and motorways. Finally, the sixth chapter summarizes all the main passages in the thesis, highlighting the previous chapters’ original contributions. It also tries to summarize a take-home message about statistical modeling’s fundamental importance as a scientific tool to formulate and verify hypotheses that must not be discouraged by new challenges and technological advancements
    corecore